Mixed Continuous and Categorical Flow Matching for 3D De Novo Molecule Generation

Deep generative models that produce novel molecular structures have the potential to facilitate chemical discovery. Diffusion models currently achieve state of the art performance for 3D molecule generation. In this work, we explore the use of flow matching, a recently proposed generative modeling framework that generalizes diffusion models, for the task of de novo molecule generation. Flow matching provides flexibility in model design; however, the framework is predicated on the assumption of continuously-valued data. 3D de novo molecule generation requires jointly sampling continuous and categorical variables such as atom position and atom type. We extend the flow matching framework to categorical data by constructing flows that are constrained to exist on a continuous representation of categorical data known as the probability simplex. We call this extension SimplexFlow. We explore the use of SimplexFlow for de novo molecule generation. However, we find that, in practice, a simpler approach that makes no accommodations for the categorical nature of the data yields equivalent or superior performance. As a result of these experiments, we present FlowMol, a flow matching model for 3D de novo generative model that achieves improved performance over prior flow matching methods, and we raise important questions about the design of prior distributions for achieving strong performance in flow matching models. Code and trained models for reproducing this work are available at https://github.com/dunni3/FlowMol.


Introduction
Deep generative models that can directly sample molecular structures with desired properties have the potential to accelerate chemical discovery by reducing or eliminating the need to engage in resource-intensive screening-based based discovery paradigms.Moreover, generative models may also improve chemical discovery by enabling multi-objective design of chemical matter.In pursuit of this idea, there has been recent interest in developing generative models for the design of small-molecule therapeutics [1][2][3][4][5][6][7][8], proteins [9][10][11], and materials [12].State of the art performance in these tasks is presently achieved by applying diffusion models [13][14][15] to point cloud representations of molecular structures.
Flow matching, a recently proposed generative modeling framework [16][17][18][19], generalizes diffusion models.Under diffusion models, the transformation of prior samples to data is formulated as a reversal of a predefined forward process.The forward process is a Markov chain or differential equation that must converge to a tractable stationary distribution as t → ∞; this requirement constrains the viable options for forward/reverse processes and prior distributions.In contrast, flow matching prescribes a method for directly learning a differential equation that maps samples from nearly arbitrary distributions.In doing so, flow matching permits valuable flexibility when designing models for specific applications.For example, Jing et al. [20] and Stärk et al. [21] make use of the fact that flow matching allows arbitrary prior distributions to design models whose priors are closer to realistic 3D molecular conformations than a Gaussian prior.
In this work we explore the application of flow matching to 3D de novo small molecule generation.We adapt the approach of state of the art diffusion models for this task [22][23][24] to the flow matching framework.This approach entails predicting atom positions, atom types (chemical elements), formal charges, and bond orders between all pairs An ordinary differential equation parameterized by a graph neural network transforms a prior distribution over atom positions, types, charges, and bond orders to the distribution of valid molecules.Black arrows show the instantaneous direction of the ODE on atom positions.Middle: Trajectory of the atom type vector for a single atom under SimplexFlow, a variant of flow matching developed for categorical variables.Atom type flows lie on the probability simplex.Bottom: Trajectory of an atom type vector starting from a Gaussian prior.This approach does not respect the categorical nature of the data; however, we find it yields superior performance to SimplexFlow. of atoms.All of these variables are categorical with the exception of atom positions.Therefore, molecule generation requires sampling from a joint distribution of continuous and categorical variables.
Effectively adapting flow matching for this mixed continuous/categorical generative task may be non-trivial because the flow matching framework is predicated on the assumption of continuously valued data.In this work, we extend the flow matching framework to categorical data by constructing flows that are constrained to exist on a continuous representation of categorical data known as the probability simplex.We call this extension SimplexFlow.We present a model for de novo small-molecule generation that uses SimplexFlows to generate categorical features.This work was motivated by the intuition that designing a generative process that respects the categorical nature of the data it operates on may yield improved performance; however, our empirical results contradict this intuition.We show that in practice, a simpler approach that makes no accommodations for the categorical nature of the data yields superior performance to a de novo model using SimplexFlow.Our final flow matching model for molecule generation, FlowMol, achieves improved performance over existing flow matching methods for molecule generation and is competitive with state of the art diffusion models while exhibiting a >10-fold reduction in inference time.

Discrete Diffusion
The original formulation of diffusion models [13] was defined in terms of a Markov chain of random variables that converged to a tractable stationary distribution in the limit of an infinite number of steps in the Markov chain.This formulation made no assumptions about the sample space of the random variables modeled, allowing for natural extensions to discrete data [25][26][27].
A separate formulation of diffusion models as continuous-time stochastic differential equations (SDE) [15] became popular in the literature.The SDE formulation of diffusion models is dependent on the assumption of having continuously-valued data.Similar to our approach, there is a line of work developing SDE-based diffusion models that operate on continuous representations of discrete data.Several works developed diffusion models where diffusion trajectories were constrained to the simplex [28][29][30].An alternative approach is to embed categorical features into a continuous latent space and train diffusion models on the embeddings [31].
Hoogeboom et al. [44] proposed the first diffusion model for 3D molecule generation, which yielded superior performance over previous approaches.Molecules are represented in Hoogeboom et al. [44] by attributed point clouds where each atom has a position in space and type.A continuous diffusion process is defined for both atom positions and types where the prior for both is a standard Gaussian distribution.A purported weakness of this approach is that atom connectivity is not predicted by the model and must be inferred in a post-processing step.Several concurrent works sought to address these issues by predicting bond order in addition to atom positions/types: Huang et al. [22], Vignac et al. [23], Peng et al. [24], Hua et al. [45].These models report substantially improved performance over Hoogeboom et al. [44].Three of these four concurrent works (Vignac et al. [23], Peng et al. [24], Hua et al. [45]) use discrete diffusion processes for categorical features and attribute (in part) their improved model performance to the use of discrete diffusion.;however, only Peng et al. [24] presents an ablation study isolating the effect of discrete diffusion.Moreover, Huang et al. [22] uses only continuous diffusion processes and reports superior performance.This suggests that while predicting graph connectivity provides performance benefits, the utility of discrete diffusion for molecule generation is less clear.Vignac et al. [23] and Huang et al. [22] fully specify the molecular structure by also predicting atom formal charges and the presence of hydrogen atoms; for this reason, these works are the most similar to the model presented here.

Flow-Matching for De Novo Molecule Generation
To our knowledge, Song et al. [46] is the only existing work that performs de novo molecule generation with flow matching.Molecules are represented as point clouds where each atom has a position in space and an atom type.The final molecule structure is inferred after the inference procedure.The prior distribution for atom type vectors is a standard Gaussian distribution, and so the generative process does not have any inductive biases to respect the discrete nature of the data.This work can be viewed as the flow matching analog of Hoogeboom et al. [44].

Flow Matching for Discrete Data
Concurrent work [47] developed a variant of flow matching on the simplex which we refer to as Dirichlet Flows.In Dirichlet Flows, conditional probability paths are only conditioned on x 1 and, as a result, do not permit arbitrary choices of the prior and must use a uniform distribution over the simplex.In contrast, our formulation permits the use of any prior distribution.Stark et al. [47] identify problems with the choice of commonly used conditional vector fields that limit performance on variables with a large number of categories.They propose an alternative choice of conditional probability paths that alleviate this issue.
There are also other works which develop flow matching variants for discrete data.Boll et al. [48] equip the simplex with the Fisher-Rao metric to form a Riemannian manifold, and apply Riemannian Flow Matching [49] to this manifold.Campbell et al. [50] develop a flow matching method for discrete data built on continuous-time Markov chains.
Importantly, none of the aforementioned works, which present methods for training flow matching models for categorical data, benchmark their model performance against simpler flow matching models that do not account for the categorical nature of their data.

Flow Matching
Flow matching [16][17][18][19] is a new generative modeling framework that generalizes diffusion models.Flow matching permits useful design flexibility in the choice of prior of and nature of the map between two distributions.Flow matching is also conceptually simpler than diffusion and permits substantially faster inference.We briefly describe the flow matching framework here.
An ordinary differential equation (ODE) that exists on R d is defined by a smooth, time-dependent vector-field u(x, t) : Note that we only consider this ODE on the time interval [0, 1].For simplicity we will use u t (x) interchangeably with u(x, t).Given a probability distribution over initial positions x 0 ∼ p 0 (x), the ODE (1) induces time dependent probability distributions p t (x).The objective in flow matching is to approximate a vector field u t (x) that pushes a source distribution p 0 (x) to a desired target distribution p 1 (x).A neural network u θ can be regressed to the vector field u t by minimizing the Flow Matching loss.
Computing L F M requires access to u t and p t , quantities that are typically intractable.Flow matching provides a method for approximating u t (x) without having access to it.If we consider the probability path p t (x) to be a mixture of conditional probability paths p t (x|z): and we know the form of the the conditional vector fields u t (x|z) that produce p t (x|z), then the marginal vector field u t (x) can be defined as a mixture of conditional vector fields: We still cannot compute u t (x) but the neural network u θ that is the minimizer of L F M is also the minimizer of the Conditional Flow Matching (CFM) loss defined in ( 5) That is, regressing to conditional vector fields, in expectation, is equivalent to regressing to the marginal vector field.The remaining design choices for a flow matching model are the choice of conditioning variable z, conditional probability paths p t (x|z), and conditional vector fields u t (x|z).

Problem Setting
We represent a molecule with N atoms as a fully-connected graph.Each atom is a node in the graph.Every atom has a position in space X = {x i } N i=1 ∈ R N ×3 , an atom type (in this case the atomic element) Additionally, every pair of atoms has a bond order Where n a , n c , n e are the number of possible atom types, charges, and bond orders; these are categorical variables represented by one-hot vectors.For brevity, we denote a molecule by the symbol g, which can be thought of as a tuple of the constituent data types g = (X, A, C, E).
There is no closed-form expression or analytical technique for sampling the distribution of realistic molecules p(g).We seek to train a flow matching model to sample this distribution.Concretely, we choose the the target distribution that is the distribution of valid 3D molecules p 1 (g) = p(g).Our choice of prior p 0 (g) is described in Section 3.5.
Our strategy for adapting flow matching to molecular structure is one that mimics prior work on applying diffusion and flow-based generative models to molecular structure.That is, we define conditional vector fields and conditional probability paths for each data modality and jointly regress one neural network for all data modalities.Our total loss is a weighted combination of CFM losses from (5): Where (η X , η A , η C , η E ) are scalars weighting the relative contribution of each loss term.We set these values to (3, 0.4, 1, 2) as was done in Vignac et al. [23].Our specific choice of conditional vector fields and probability paths is described in Section 3.2.In practice, we use a variant of the CFM objective called the endpoint-parameterized objective that we present in Section 3.3.These choices are used to in turn to design SimplexFlow, our method of performing flow matching for categorical variables, which is described in Section 3.4.

Flow Matching with Temporally Non-Linear Interpolants
We choose the conditioning variable to be the initial and final states of a trajectory: z = (g 0 , g 1 ).We choose the conditional probability path to be a Dirac density placed on a "straight" line connecting these states p t (g|g 0 , g 1 ) = δ(g − (1 − α t )g 0 − α t g 1 ).This particular choice of conditional vector fields and probability paths gives us the freedom to choose any prior distribution p 0 (g) [17,18,36].Our choice of p t (g|g 0 , g 1 ) is equivalent to defining a deterministic interpolant: where α t : [0, 1] → [0, 1] is a function that takes t as input and returns a value between 0 and 1.The rate at which a molecule from the prior distribution g 0 is transformed into a valid molecule g 1 can be controlled by choice of α t , which we name the "interpolant schedule." 1 We define separate interpolant schedules for each data type comprising a molecule: Taking inspiration from Vignac et al. [23], we define a cosine interpolant schedule: where different values of ν are set for atom positions, types, charges, and bond orders.The interpolant (7) gives rise to conditional vector fields of the form: Where α ′ t is the time derivative of α t .

Endpoint Parameterization
By solving (7) for g 0 and substituting this expression into (9) we obtain an alternate form of the conditional vector field.
As described in Section 2.5, the typical flow matching procedure is to regress a neural network u θ (g t ) directly to conditional vector fields by minimizing the CFM loss (5).Instead, we apply a reparameterization initially proposed by Jing et al. [20]: By substituting ( 11) and ( 10) into (5), we obtain our endpoint-parameterized objective Therefore our objective becomes to train a neural network that predicts valid molecular structures given samples from a conditional probability path ĝ1 (g t ).This is particularly advantageous when operating on categorical data, as placing a softmax layer on model outputs constrains the domain of model outputs to the simplex.Empirically, we find that the endpoint objective yields better performance than the vector field regression objective (5) for the task of molecule generation.Moreover, we leverage the theoretical guarantee that our predicted endpoint for categorical data lie on the simplex to ensure our flows lie on the simplex.
In practice, the interpolant-dependent loss weight α ′ t 1−αt produces unreasonably large values as α t → 1.We replace this term with a time-dependent loss function inspired by Le et al. [51]: w(t) = min(max(0.005,αt 1−αt ), 1.5).For categorical variables we use a cross entropy loss rather than the L2 norm shown in (12).

SimplexFlow
To design flow matching for categorical data, our strategy is to define a continuous representation of categorical variables, and then construct a flow matching model where flows are constrained to this representation.We choose the d-dimensional probability simplex S d as the continuous representation of a d-categorical variable.
A d-categorical variable x 1 ∈ {1, 2, . . ., d} can be converted to a point on S d via one-hot encoding.Correspondingly, the categorical distribution p 1 (x) = C(q) can be converted to a distribution on S d as: where e i is the i th vertex of the simplex and q i is the probability of x belonging to the i th category.If we choose a prior distribution p 0 (x) such that supp(p 0 ) = S d , then all conditional probability paths produced by the interpolant (7) will lie on the simplex.This is because the simplex is closed under linear interpolation (see Appendix A) and the conditional trajectories are obtained by linearly interpolating between two points on the simplex (x 0 , x 1 ∈ S d ).
Although choosing (p 0 , p 1 ) with support on the simplex results in conditional trajectories on the simplex, training a flow under the vector field objective (5) provides no guarantee that trajectories produced by the learned vector field lie on the simplex.However, training a flow matching model under the endpoint parameterization (Section 3.3) enables us to guarantee by construction that generated flows lie on the simplex; proof of this is provided in Appendix B.

Priors
We define the prior distribution for a molecule as a composition of independent samples for each atom and pair of atoms.Our prior distributions take the form: Our choice of conditional trajectory (7) permits the choice of any prior distribution.SimplexFlow places the constraint that the prior distribution for categorical variables have support bounded to the simplex.
We always set p 0 (x i ) = N (x i |0, I); atom positions are independently sampled from a standard Gaussian distribution.
We explore the use of several prior distributions for categorical variables a i , c i , e ij .We experiment with three different categorical priors for SimplexFlow.The uniform-simplex prior is a uniform distribution over the simplex; the simplest choice for a categorical prior.This choice is analogous to the "Linear FM" model described in [47].The marginalsimplex prior is designed to be "closer" to the data distribution by using marginal distributions observed in the training data.Specifically, we replace p 0 (a i )p 0 (c i ) and p 0 (e ij ) in ( 15) with p 1 (a i , c i ) and p 1 (e ij ), respectively.Finally, for the barycenter prior, categorical variables are placed at the barycenter of the simplex; the point in the center of the simplex assigning equal probability to all categories.The intuition behind the barycenter prior is all categorical variables will be "undecided" at t = 0.
In practice, the model fails when the prior distributions for categorical variables only have density on a small, fixed number of points on the simplex; this is the case for the marginal-simplex and barycenter priors.We find that "blurring" the prior samples for categorical variables significantly improves performance.That is, Gaussian noise is added to the samples before they are projected back onto the simplex.

Optimal Transport Alignment
Previous work [17] has shown that aligning prior and target samples via optimal transport significantly improves the performance of flow matching by minimizing the extent to which conditional trajectories intersect.When performing flow matching on molecular structure, this consists of computing the optimal permutation of node ordering and the rigid-body alignment of atom positions [46,52].We apply the same alignment between target and prior positions at training time.This also ensures that prior positions p 0 (X) and target positions p 1 (X) effectively exist in the center of mass free subspace proposed in Hoogeboom et al. [44] that renders the target density p 1 (g) invariant to translations.

Model Architecture
Molecules are treated as fully-connected graphs.The model is designed to accept a sample g t and predict the final destination molecule g 1 .Within the neural network, molecular features are grouped into node positions, node scalar features, node vector features, and edge features.Node positions are identical to atom positions discussed in Section 3.1.Node scalar features are a concatenation of atom type and atom charge.Node vector features are geometric vectors (vectors with rotation order 1) that are relative to the node position.Node vector features are initialized to zero vectors.Molecular features are iteratively updated by passing g t through several Molecule Update Blocks.A Molecule Update Block uses Geometric Vector Perceptrons (GVPs) [53] to handle vector features.Molecule Update Blocks are composed of three components: a node feature update (NFU), node position update (NPU) and edge feature update (EFU).The NFU uses a message-passing graph convolution to update node features.The NPU and EFU blocks are node and edge-wise operations, respectively.Following several molecule update blocks, predictions of the final categorical features ( Â1 , Ĉ1 , Ê1 ) are generated by passing node and edge features through shallow node-wise and edge-wise multi layer perceptrons (MLPs).For models using endpoint parameterization, these MLPs include softmax activations.The model architecture is visualized in Figure 2 and explained in detail in Appendix D.
In practice, graphs are directed.For every pair of atoms i, j there exists edges in both directions: i → j and j → i.
When predicting the final bond orders Ê1 for an edge, we ensure that one prediction is made per pair of atoms and that this prediction is invariant to permutations of the atom indexing.This is accomplished by making our prediction from the sum of the learned bond features.That is, êij 1 = M LP (e ij + e ji ).GVPs, as they were originally designed, predict vector quantities that are E(3)-equivariant.We introduce a variant of GVP that is made SE(3)-equivariant by the addition of cross product operations.The cross product is equivariant to rotations and translations of input vectors but not reflections.As a result, the learned density p 1 (g) is invariant to rotations and translations but not reflections.In other words, FlowMol is sensitive to chirality.Empirically we find that the addition of cross product operations to GVP improves performance.Schneuing et al. [3] proposed the addition of a cross product operation to the EGNN architecture [54]; we adopt this idea for GVP.We refer the reader to Appendix F of Schneuing et al. [3] for a detailed discussion of the equivariance of cross products.Our cross product variant of GVP is described in Appendix D.1.

Datasets
We train on QM9 [55,56] and GEOM-Drugs [57] using explicit hydrogens.QM9 contains 124k small molecules, each with one 3D conformation.GEOM-Drugs contains approximately 300k larger, drug-like molecules with multiple conformers for each molecule.Molecules in QM9 have an average of 18 atoms a max of 29 while those in GEOM-Drugs have an average of 44 atoms and a max of 181.We use the same dataset splits as Vignac et al. [23].
We chose to use explicit hydrogens because it is a more difficult learning task.By predicting explicit hydrogens in combination with atom types, bond orders, and formal charges, there is a 1-to-1 mapping from model outputs to molecules.If any one of these components were removed from the generative task, one model output could plausibly be interpreted as multiple molecular structures, and so it is "easier" for the model output to be interpreted as "correct" or "valid."We view the task of predicting graph topology and structure with explicit hydrogens and formal charges as the most rigorous evaluation of the capabilities of generative models to fit the distribution of valid molecular structures.

Model Evaluation
We report three metrics measuring the validity of generated molecular topology: percent atoms stable, percent molecules stable, and percent molecules valid.An atom is defined as "stable" if it has valid valency.Atomic valency is defined as the sum of bond orders that an atom is participating atom.Aromatic bonds are assigned a bond order of 1.5.A valid valency is defined as any valency that is observed in the training data for atoms of a given element and formal charge.A molecule is counted as stable if all of its constituent atoms are stable.A molecule is considered "valid" if it can be sanitized by rdkit [58] using default sanitization settings.
Metrics regarding the validity of molecular topology fail to capture a model's ability to reproduce reasonable molecular geometries.Therefore, we also compute the Jensen-Shannon divergence of the distribution of potential energies for molecules in the training data and molecules sampled from trained models.Potential energies are obtained from the Merck Molecular Mechanics Force-Field implemented in rdkit [58].Force-field energy cannot be obtained for molecules that cannot be sanitized by rdkit, and so the reported Jensen-Shannon divergences are for valid molecules only.
Molecule quality metrics are reported for samples of 10, 000 molecules, repeated 5 times.We report inference time for FlowMol and baseline models.We measure inference time as the time required to generate one batch of 100 molecules on the same NVIDIA GeForce RTX 2060 GPU.This inference procedure is also repeated five times.Inference is run on FlowMol using Euler integration with 100 evenly-spaced timesteps.All results are reported with 95% confidence intervals.For all samplings, the number of atom in each molecule is sampled from the distribution of atoms in molecules from the training data.

Model Ablations
We train multiple versions of our model to evaluate the effects of several aforementioned design choices.To observe the effect of endpoint reparameterization (sec 3.3), we train equivalent models with both the vector-field objective (5) and the endpoint objective (12).We train models using SimplexFlow with all three categorical priors proposed in Section 3.5 which have support on the simplex.To determine whether SimplexFlow improves performance, we also train models where the prior distribution for categorical features is a standard Gaussian distribution.In this setting, the generated flows are not constrained to the simplex, and it can be said that the flows do not "respect" the categorical nature of the data.This is similar to the atom type flows in Song et al. [46] and atom type diffusion in Hoogeboom et al. [44].
All of the mentioned model ablations are tested on the QM9 dataset and the results are presented in Section 5.1.A subset of these ablations were also performed on the GEOM dataset.GEOM ablations are available in Appendix E. None of the effects observed in GEOM ablations contradict those seen for QM9 ablations.For metrics reported in ablations, results are averaged over two identical models trained with different random seeds

Comparison to Dirichlet Flows
We compare SimplexFlow to concurrent work that developed Dirichlet Flows [47] for flow matching on the simplex.Briefly, for a d-categorical variable x represented as a point on the simplex, the conditional probability path is Where Dir is a Dirichlet distribution parameterized by γ and ω represents time.The Dirichlet conditional flow must start at ω = 1 and only converges to δ(x − e i ) in the limit ω → ∞.In order to incorporate Dirichlet flows into our model, we define the relation ω t = ω max α t + 1, where α t is defined by (8).Dirichlet flow matching necessitates the use of a uniform prior over the simplex for categorical variables and so we do not experiment with other simplex priors described in Section 3.5.

Baselines
We compare FlowMol to three baselines: MiDi [23], JODO [22], and EquiFM [46].MiDi and JODO perform the same generation task: predicting atom positions, atom types, formal charges, and bond orders.The key difference from FlowMol is that MiDi and JODO are diffusion models.EquiFM as described in Section 2.3 is a flow matching model for de novo molecule generation; however, the model does not predict bond orders or atomic charges.We do not report the performance of EquiFM on the GEOM dataset because the authors have not released a model checkpoint.

Model Ablations
Results of model ablation experiments on the QM9 dataset are shown in Table 1.Most notably, models that use SimplexFlow for categorical variables (those with categorical priors constrained to the simplex) consistently underperform models with Gaussian categorical priors.The best performing SimplexFlow model (endpoint parameterization, marginal-simplex prior) achieves 96.1% valid molecules while an equivalent model using a Gaussian prior achieves 96.9% valid molecules.
Models trained under the endpoint objective achieve superior performance to otherwise identical models trained under the vector-field objective.For example, Table 1 shows that a model trained with the a marginal-simplex categorical prior obtains 79% stable molecules under the vector-field objective and 92% stable molecules under the endpoint objective.This is effect is also observed with models using a Gaussian categorical prior but to a lesser extent.
We find that models using Dirichlet conditional probability paths [47] yields approximately equivalent performance to the conditional probability path (7) with a uniform-simplex categorical prior.Among models satisfying the constraints of SimplexFlow (sec.3.4), the uniform-simplex prior yielded the worst performance.The marginal-simplex and barycenter priors yield approximately equivalent performance.Although the models using marginal-simplex and barycenter priors produce relatively fewer valid molecules, the molecules generated by these models exhibit the lowest Jensen-Shannon divergence to the energy distribution of the training data.

Comparison with Baselines
FlowMol achieves superior performance to EquiFM [46] on QM9; for example, it produces 3% more valid molecules while having equivalent divergence to the training data energy distribution.FlowMol approaches the performance of diffusion baselines (JODO, MiDi) on QM9 but does not perform as well on the GEOM-Drugs dataset.The fact that fewer generated molecules are valid on the GEOM-Drugs dataset cannot be attributed solely to the difference in molecule sizes between the two datasets, because FlowMol's atom-level stability is also worse for GEOM-Drugs than QM9 (99.0% on GEOM vs 99.7% on QM9).Despite the fact that MiDi and FlowMol achieve equivalent atom-level stability (99.0%),MiDi produces significantly more topologically correct molecules.For example, FlowMol achieves 68% stable molecules while MiDi achieves 85%.FlowMol exhibits substantially faster inferences times than all baseline models.This difference is primarily due to the fewer number of integration steps needed by FlowMol.We find empirically that sample quality does not improve when using more than 100 integration steps.JODO, MiDi, and EquiFM use 1000 integration steps by default.The need for fewer integration steps than diffusion models is a recognized advantage of flow matching models over diffusion [17,19].

Discussion
FlowMol improves upon the existing state of the art flow matching method for molecule generation; however, it still does not outperform diffusion models trained for the same task.A key difference between FlowMol and the diffusion baselines presented here is that the conditional trajectories are deterministic in FlowMol and stochastic in diffusion models.Prior works have presented theoretical [18] and empirical [21] evidence that stochastic conditional trajectories yield improved model performance.
Our results raise interesting questions about the design of prior distributions for flow matching models.Our intuition was that a stronger prior that is "closer" to the data distribution would yield more faithful recapitulation of the target distribution.The results of our model ablations suggest this intuition is incorrect.The next natural questions are: why is a Gaussian prior the most performant of those tested here?and what are the qualities of a prior that best enable recapitulation of the target distribution?A possible explanation for our results is a dependence on the "volume" of the prior.Empirically when the prior for categorical features has support on a small number of unique values, the model fails to produce any valid molecules.Adding a "blur" as described in Section 3.5 dramatically improves model performance.Correspondingly, priors constrained to the simplex reliably yield poorer performance than Gaussian priors; these observations could all be explained through the perspective of the prior's capacity for serving as one domain of a homeomorphism to a more complex distribution.
Another explanation for the superiority of Gaussian priors may involve the shape of conditional trajectories induced by the prior.Conditional trajectories are more likely to intersect when constrained to a smaller space, such as the simplex.This explanation is also supported by the observation that the marginal-simplex and barycenter priors yield substantially improved performance over uniform-simplex priors.Tong et al. [17] suggest that sampling conditional pairs (g 0 , g 1 ) from an optimal transport (OT) alignment π(g 0 , g 1 ) improves performance precisely because the marginal vector field yields straighter lines with fewer intersections.In this work, an OT plan is computed but only for atomic positions.
Perhaps computing an OT alignment over the product space of all the data modalities represented here could alleviate this issue.

Conclusions
FlowMol is the first generative model to jointly sample the topological and geometric structure of small molecules.FlowMol improves upon existing flow matching models for molecule generation and achieves competitive performance with diffusion-based models while exhibiting inference speeds an order of magnitude faster.We present a method for flow matching on categorical variables, SimplexFlow, and demonstrate that constraining flows to a smaller space does not yield performance benefits.We think this result raises interesting and relevant questions about the design of flow matching for mixed continuous/categorical generative tasks and provide potential hypotheses to begin exploring in future work.

ŷ1 (y t ), y t ∈ S d
The first condition is obviously true.
The second condition can be written as two inequalities 0 ≤ α ′ t (s−t) 1−αt ≤ 1.The first inequality reduces to α ′ t ≥ 0; the interpolant must be monotonically increasing.The second inequality can be seen as an upper bound on the step size that can be used during integration: For well-behaved interpolant schedules and many reasonable choices of interpolant, this inequality is satisfied in the limit as s − t → 0.
Regarding the third condition: by definition, ŷ1 ∈ S d ; this is practically enforced by placing softmax activations on the output of the neural network.If y 0 ∈ S d , a condition which is guaranteed by our choice of prior p 0 (y), then the first application of the update rule (23) would satisfy all three conditions and as a result y 0+∆t ∈ S d .By induction, every subsequent application of the update rule (23) would yield an integration step that is linear interpolation between two points on the simplex.
As a result, all trajectories generated by the ODE will lie on the simplex in the limit of a infinitely small integration step.And, in practice, infinitely small integration steps are not actually necessary to yield trajectories on the simplex.More on this in Appendix C.

C Intepolation Schedules and Integration Step Sizes
The relative rates at which molecular features are generated are determined by setting values of the parameter ν in the cosine interpolant schedule (8).For the QM9 dataset we set ν = (ν X , ν A , ν C , ν E ) = (1, 2, 2, 1.5).For the GEOM-Drugs dataset this is set to (1, 2, 2, 2).These are the same values used for the cosine noise schedule in Vignac et al. [23].The interpolant schedules used for the QM9 dataset are plotted in Figure 3.In Appendix B we derive an upper bound on the integration step size that can be used that guarantees SimplexFlow trajectories will remain on the simplex (24).In Figure 4 we plot this maximum step size as a function of t for for cosine interpolation schedules (8).For the results presented in this paper we sample molecules by performing Euler intergration with 100 evenly-spaced integration steps.This corresponds to a constant step size of 10 −2 .According to Figure 4, this step size ensures trajectories will remain on the simplex until approximately t = 0.98.

D Model Architecture
FlowMol is implemented using PyTorch and the Deep Graph Library (DGL) [59].
Each node is endowed with a position in space x i ∈ R 3 , scalar features s i ∈ R d , and vector features v i ∈ R c×3 .Scalar features are initialized at the network input by concatenating atom type and charge vectors: Vector features are initialized to zeros v (0) i = 0.Each edge is endowed with scalar edge features that, at the input to the network, are the bond order at time t.We enforce that the bond order on both edges for a pair of atoms is identical: e Molecule Update Block We define a Molecule Update Block which will update all graph features x i , s i , v i , e ij .Each molecule update block is comprised of 3-sub blocks: a node feature update block, a node position update block, and an edge feature update block.The input molecule graph is passed through L Molecule Update Blocks.Vector features are operated on by geometric vector perceptions (GVPs).A detailed description of our implementation of GVP is provided in Section D.1.
Node Feature Update Block The node feature update block will perform a graph convolution to update node scalar and vector features s i , v i .The message generating and node-update functions for this graph convolution are each chains of GVPs.GVPs accept and return a tuple of scalar and vector features.Therefore, scalar and vector messages m Where : denotes concatenation, and d ij is the distance between nodes i and j at molecule update block l.In practice, we replace all instances of d ij with a radial basis embedding of that distance before passing through GVPs or MLPs.Message aggregation and node features updates are performed as described in [53]: The node update function ψ U is a chain of three GVPs.
Node Position Update Block The purpose of this block is to update node positions x i .Node positions are updated as follows: E Model Ablations on GEOM Dataset

Figure 1 :
Figure 1: Overview of FlowMol Top: We adapt the flow matching framework for unconditional 3D molecule generation.An ordinary differential equation parameterized by a graph neural network transforms a prior distribution over atom positions, types, charges, and bond orders to the distribution of valid molecules.Black arrows show the instantaneous direction of the ODE on atom positions.Middle: Trajectory of the atom type vector for a single atom under SimplexFlow, a variant of flow matching developed for categorical variables.Atom type flows lie on the probability simplex.Bottom:Trajectory of an atom type vector starting from a Gaussian prior.This approach does not respect the categorical nature of the data; however, we find it yields superior performance to SimplexFlow.

Figure 2 :
Figure 2: FlowMol Architecture Top left: An input molecular graph g t is transformed into a predicted final molecular graph g 1 by being passed through multiple molelcule update blocks.Top right: A molecule update block uses NFU, NPU, and EFU sub-components to update all molecular features.Bottom: Update equations for graph features.ϕ and ψ is used to denote MLPs and GVPs, respectively.

Figure 4 :
Figure 4: Left: maximum integration step size to remain on the simplex for a cosine interpolant.Right: zoomed in view of the asymptotic decline of the maximum step size as t → 1 i→j are generated by a single function ψ M which is two GVPs chained together. m

Table 2 :
Comparison of FlowMol to baseline models on the QM9 and GEOM-Drugs datasets

Table 3 :
FlowMol ablations on GEOM-Drugs with explicit hydrogens QM9 models are trained with 8 Molecule Update Blocks while GEOM models are trained with 5. Atoms contain 256 hidden scalar features and 16 hidden vector features.Edges contain 128 hidden features.QM9 models are trained for 1000 epochs and GEOM models are trained for 20 epochs.QM9 models are trained on a single L40 GPU with a batch size of 64.GEOM models are trained on 4xL40 GPUs with a per-GPU batch size of 16.QM9 models train in about 3-4 days while GEOM models take 4-5 days.All model hyperparameters are visible in the config files provided in our github repository.